{% extends 'base.html' %} {% block page_content %} PythonForDataScience_project_VIDAL_Sara_TESTU_Constantin

Python For Data Science

importing the dataset :

Let's have a look at the data first, to better start the data processing:

We can have a first glance a the dataset by using pd.info():

We can see that the data set is already clean: there are no NaN values and all features are float features (apart from the URL)

Selection of the columns to be scaled depending on the variable type :

We can observe here that the repartition of the number of shares has some outliers (probably articles that went viral). This could impact the performance of our models. We can maybe cut the outliers (e.g. articles that have more than 15K shares).

Many features have outliers. Using robust scaling (based on medians) might do the trick

With the correlation matrix we observe that some group of features have a strong correlation with one-another, but few correlation with the rest of the dataset (like the week days). On the other hand the shares have very poor correlation with any other features.

When applying the log function to the shares, we can see a bit more contrasted correlation for some features. We shall then try to use the log(shares) feature as it might improve our scores.

In terms of feature engineering, we will, for the moment try to use all features, and then we will see if some of them are irrelevant and should be removed.
So let's split the data :

Now let's scale our data using robust scale so that values on different scales can be compared:

The objective here is to predict a quantity (the number of shares an article gets). We have less than 100K examples, and we have many features that might be important. We shall thus begin to try two models:

  1. Ridge Regression
  2. Random Forest

1) Ridge Regression:

2) Random Forest

At this point the results are pretty terrible: on average, the model prediction makes an error of 8564 shares.

Feature Engineering:

We will try to reduce the number of features and apply the log function to the target feature.

We selected the features that had the strongest correlation according to the correlation matrix.

Even though the mean errors seem smaller when we apply the log function on the target feature, this reduction is due to the fact that the scale of the data in that feature is then much smaller (as it has the log function applied), thus the mean error is smaller as well, but the regression is not necessarily more accurate.

The results we get are terrible, there are not enough linear or logarithmic relations between the features to apply a regression

New approach : Classification

After reading the paper published by the creators of the dataset, we realized that they were actually applying classification algorithms. The target feature is numerical, but they set a threshold, where every article having more shares than that threshold would be considered "popular", and other articles would be considered "not popular". That way we would have only two different values for that feature, making it catergorical.

The mean value is close to the 80th percentile value. Thus if a number of share is above average, it means that it is better than 80% of the articles of the dataset.

We will then set our threshold to 3395 to sort the articles between not popular (0) and popular (1)

Splitting the data:

Scaling the data:

We will try four different classification algorithms and see which one classifies the best the data :

  1. Random Forest
  2. Adaptive Boosting Classifier
  3. K-Nearest-Neighbours (KNN)
  4. Naive Bayes

1) Random Forest:

Using only accuracy to evaluate the performance of our model would not be clever as the repartition between the popular and not popular articles is 20%/80%. This we will need other metrics to evaluate it:

We could be satisfied with a AUC score of 70%, but we will see if other models perform better:

2) Adaptive Boosting Classifier

3) K-Nearest Neighbors:

4) Naive Bayes:

Now that we have tested each model, let us compare the different scores we got:

It is then the Adaboost that gave us the best AUC score. Let us then try to optimize it:

Tuning:

To improve our model we will tune it with a GridSearch:

It will test each parameter of the list to find which one gives the best score.

Conclusion:

Despite struggling at the beginning with the regression, we eventually manage to obtain a great score with our improved Adaboost model, thanks to a scaling, a model selection, and an optimization of the hyperparameters with a gridsearch.

{% endblock %}